ClickBench Playground#904
Open
alexey-milovidov wants to merge 227 commits into
Open
Conversation
WIP checkpoint. Lets visitors run SQL against any of the 80+ ClickBench
systems via a single-page UI, each isolated in a per-system Firecracker
microVM.
- server/ aiohttp API: /api/systems, /api/state, /api/query,
/api/admin/provision. Owns the per-system VM lifecycle,
a 1-Hz CPU/disk/host-pressure watchdog, and a batched
ClickHouse-Cloud logging sink (JSONL fallback).
- agent/ stdlib HTTP agent that runs inside each VM and wraps the
system's install/start/load/query scripts.
- images/ scripts to build the base Ubuntu 22.04 rootfs + per-system
rootfs/system-disk pair (200 GB sparse + 16/88 GB sized
for the system's data format).
- web/ vanilla JS SPA — system picker, query box, X-Query-Time /
X-Output-Truncated rendering.
Smoke-tested: base rootfs boots under Firecracker, agent comes up in
~2 s, /health and /stats respond. Agent self-test on the host (no VM)
covers all 4 endpoints including 10 KB output truncation. ClickHouse
provisioning is in flight; see playground/docs/build-progress.md for
the running checkpoint.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A later `umount -lR` on the chroot's /dev was propagating through the shared mount group and tearing down the host's /dev/pts, breaking sshd's PTY allocation. `--make-rslave` keeps mount events flowing *into* the chroot but blocks unmounts from leaking back to the host. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A 16 GB guest snapshot.bin compresses to ~2 GB once we
1) stop+start the system daemon (sheds INSERT-time heap arenas,
buffers, fresh allocator pages),
2) echo 3 > drop_caches (turns 3-5 GB of page cache into zero
pages),
3) zstd -T0 -3 --long=27 (parallel, big match window — most of
the savings come from those zero pages).
Restart is skipped for in-process engines where stop/start is a
no-op AND the data lives in the process; wiping it would defeat
the whole point.
The host now keeps snapshot.bin.zst as the canonical artifact and
decompresses on demand right before /snapshot/load. snapshot.bin
itself is deleted after a successful restore + teardown.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous version threw away stdout/stderr from the pre-snapshot stop/start cycle, so a silent failure (`sudo clickhouse start` failing because the data dir was still locked by the dying daemon, etc.) left us with a snapshot of a dead clickhouse-server — restored VMs then returned "Connection refused (localhost:9000)" on every query and the only way to recover was to manually delete the snapshot. Capture stdout+stderr into the provision log so the failure mode is visible via GET /provision-log, and refuse to mark PROVISION_DONE if ./check doesn't recover within the timeout. The host then sees /provision return 500 and skips the snapshot step entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PROVISION_DONE lives on the rootfs disk (/var/lib/clickbench-agent/),
which persists across VM cold-boots. So on the second provision after
the host deleted the snapshot files, the agent saw PROVISION_DONE
already set and returned "already provisioned" — but the daemon
itself wasn't running (cold boot, no clickhouse-server in systemd),
so the host snapshotted an empty VM and every restored query came back
with "Connection refused (localhost:9000)".
Two fixes:
1. Agent: on every startup, if PROVISION_DONE is set, kick ./start
in a background thread. start is idempotent for the systems that
have a daemon, so it costs nothing when the daemon is already up
(post-restore) and brings it up when the rootfs is being re-used
across a cold reboot.
2. Host: when (re-)provisioning a system with no snapshot, drop the
existing rootfs.ext4 so install/start/load run fresh. The
system.ext4 (which holds ~14 GB of pre-staged dataset) is preserved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cloud image ships hostname=ubuntu but /etc/hosts only maps 'localhost' to 127.0.0.1. Every sudo invocation inside the VM then tries to reverse-resolve 'ubuntu' against the network — which has no DNS after the snapshot drops internet — and pays the ~2 s resolver timeout. With several sudos per ./query, that's a multi-second floor on every query, visible in the firecracker log as repeated 'sudo: unable to resolve host ubuntu: Name or service not known'. Mapping ubuntu to 127.0.0.1 short-circuits the lookup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mid-snapshot checksum-mismatch I attributed to "stopping the
daemon mid-merge" was actually FS corruption: KVM pauses the vcpus
the moment we call /vm Paused, and any ext4 writeback that was in
flight at that instant gets captured by the snapshot as half-flushed.
On restore the page cache references on-disk blocks that never landed,
and the next read sees a torn write.
Fix:
1. Drop the pre-snapshot stop/start. Killing ClickHouse at any
point never corrupts on-disk MergeTree data — only an unflushed
FS can.
2. Add a /sync endpoint to the agent and call it from the host
right before /vm Paused, so all dirty pages have hit virtio-blk
before KVM freezes the vcpus.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that the host /syncs the FS before pausing the vcpus, the snapshot captures consistent on-disk state regardless of when the daemon exits (MergeTree's on-disk format is durable under arbitrary process exit; only an unflushed *filesystem* corrupts it). So we can shut the daemon down here to evict its private heap (merge thread arenas, query cache, mark cache, uncompressed cache, ingest buffers) and snapshot what's left — mostly zero-fill RAM, which zstd compresses ~300:1. Restore path is unchanged: _kick_daemon_if_provisioned at agent startup brings the daemon back up on every cold restore. First query in a restored VM pays a 1-2 s daemon-start cost instead of carrying 8-12 GB of memory in every snapshot. In-process engines (chdb, polars, …) keep all state in RAM and have no daemon to stop; for them, has_daemon is false and we skip the stop step, falling back to drop_caches alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes for the small-snapshot path:
1. Pass init_on_free=1 in the guest kernel cmdline. Linux normally
leaves freed page frames with whatever bytes were last written to
them, so the post-`clickhouse stop` free pool was ~10 GB of stale
daemon heap and Firecracker's snapshot dump compressed only ~3:1.
init_on_free=1 zeros every page as it goes onto the free list, so
the snapshot's RAM region is genuinely zero-filled and zstd hits
~300:1.
2. Add `_ensure_daemon_started` at the top of the agent's /query
handler. After a snapshot restore (taken with the daemon stopped),
the restored memory has no daemon process and `localhost:9000`
refuses connections. The cold-boot `_kick_daemon_if_provisioned`
only fires on actual cold boots, not on snapshot resumes, so we
need an explicit check at query time. Lock-protected so concurrent
/query requests don't try to ./start the daemon twice; idempotent
and free once the daemon is up.
Also dropped the userspace _zero_free_ram hack — init_on_free does
it natively at no userspace cost.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end working with a 35 MB snapshot (16 GiB raw, ~470x ratio): SELECT COUNT(*) returns 99997497 cleanly, GROUP BY URL produces the expected top-N without any checksum errors, output truncation caps a 244 KB result at 10 KB with the right header set. Cold path (snapshot restore + daemon start): ~10 s. Warm path (live VM): subsecond on COUNT / MIN-MAX. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two correctness/efficiency fixes: 1. Shared read-only datasets disk. Previously each per-system rootfs embedded its own copy of hits.parquet / hits.tsv / hits.csv (14-75 GB each), so the catalog needed ~1-2 TB of redundant dataset storage on the host. Build one shared datasets.ext4 instead, attach to every VM read-only at LABEL=cbdata, and have the agent copy the bytes the system actually needs from /opt/clickbench/datasets into the writable per-system disk at provision time only. The agent uses os.copy_file_range so the in-VM copy is kernel-side, not bounced through userspace. 2. Golden-disk snapshot/restore. Firecracker's snapshot.bin only saves memory; the disk image referenced by the in-memory state is the live file. If anything modifies it between snapshots (background merges, log writes, /tmp churn) the next /snapshot/load points at the new disk while replaying old memory references. We were getting away with this because clickhouse-server happens to be tolerant, but it's fragile. Now /snapshot also renames the working disks into `*.golden.ext4`, and /restore-snapshot clones the goldens back into fresh working copies via `cp --sparse=always`. Every restore starts from the exact disk state captured at snapshot time. 3. Bound per-system disk builds and provisions via asyncio.Semaphore (PLAYGROUND_BUILD_CONCURRENCY=6, PLAYGROUND_PROVISION_CONCURRENCY=32) so kicking off 98 systems at once doesn't thrash the host NVMe or rate-limit Ubuntu mirrors. 4. Re-enabled `ursa` in the playground catalog (was incorrectly in the _EXTERNAL exclude list; it runs locally). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous design copied dataset files from the read-only cbdata mount into the per-VM writable cbsystem disk on every provision — 14 GB for parquet systems, 75 GB for tsv/csv. That worked but was redundant: the data is already on a read-only mount, the only reason we copied was that ClickBench's load scripts do `sudo mv` and `sudo chown` on the dataset files. Use overlayfs instead: lowerdir = /opt/clickbench/datasets_ro (RO, the shared image) upperdir = /opt/clickbench/system_upper (RW per-VM disk with scripts) merged at /opt/clickbench/system The system's load runs at cwd=/opt/clickbench/system. It sees scripts + dataset files in one tree. When it `mv`s or `chown`s a file from the lower, overlayfs does a lazy copy-up: only the file's bytes get materialised into the upper, and only when the script actually mutates it. Most ClickBench load scripts `rm` the dataset file after INSERT, which becomes a whiteout in the upper — a few bytes of metadata, not a 75 GB copy. Saves ~1-2 TB across the catalog on host disk (no per-system copies) *and* eliminates the per-provision in-VM stage. Only cost: small metadata to maintain the overlay (kilobytes). For partitioned parquet, the source files live in datasets_ro/hits_partitioned/ but the load globs cwd/hits_*.parquet, so the agent creates symlinks in the upper pointing at the lower — ~100 symlinks, a few hundred bytes total. Also: make build-datasets-image.sh idempotent. The 173 GB rsync into datasets.ext4 only needs to run when the source dir's mtime has changed; otherwise the cached image is reused. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes for the parallel-provisioning-98-systems path: 1. The _build_sem and _provision_sem fields were defined but never acquired — `provision-all.sh` kicked all 98 provisions at once and they each independently spawned build-system-rootfs.sh, which tried to write ~8 GB of rootfs base content × 98 in parallel (~780 GB of writes against a single NVMe). Disk got saturated and nothing finished. Use `async with self._build_sem:` and `async with self._provision_sem:` around the heavy phases. 2. build-system-rootfs.sh now clones the base image at block level with `cp --sparse=always` and resizes the filesystem to 200 GB in place, instead of mkfs.ext4 + mount + rsync-of-base-contents. The block-level clone touches only the ~2 GB of non-zero blocks in the base, vs. the rsync approach traversing the mounted base and writing every file individually. Per-system rootfs build goes from ~30 s to ~3 s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the agent created symlinks in the overlay's upper for partitioned parquet (hits_partitioned/* -> upper/hits_*.parquet) because the source directory was nested. That fell apart on clickhouse's load: `mv hits_*.parquet /var/lib/clickhouse/user_files/` moved the symlinks, and the subsequent `chown` followed them through to the read-only datasets disk and got `Read-only file system`. Flatten the dataset image so all 100 partitioned parquet files sit at the root next to hits.parquet / hits.tsv / hits.csv. The overlay then exposes them directly at /opt/clickbench/system as real files, no symlinks involved. clickhouse's `mv` becomes a real copy-up (and the source becomes a whiteout in upper), and the subsequent `chown` operates on a regular file on the rootfs — works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 2 GB cap on the per-VM system disk was a holdover from the in-VM-copy era, when system.ext4 only held scripts + staged data. Once we switched to overlay-with-RO-datasets, system.ext4 also holds the overlay's upperdir + workdir — i.e. every byte the load script writes lands there, including the database's own files. ClickHouse writes ~5 GB of MergeTree parts, DuckDB ~6 GB, Hyper ~10 GB; chown on partitioned parquet copies up another 14 GB. 2 GB was always going to overflow. Match the rootfs at 200 GB (apparent). The file is sparse: truncate reserves the size but allocates no physical blocks, mkfs.ext4 writes ~50 MB of metadata, and the snapshot/restore path uses `cp --sparse=always` so only the bytes the VM actually wrote land on the host disk. Light systems (chdb, sqlite, ...) cost the host near nothing; heavy ones (tidb at ~137 GB, postgres-indexed ~80 GB) fit without hitting ENOSPC mid-load. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each per-system rootfs build was running `e2fsck -fy` on its clone before `resize2fs`. With 98 systems and ~5 s per fsck of a 200 GB sparse file, that's ~8 minutes of pure disk thrash during catalog build — and entirely redundant: the base ext4 is built fresh and never mounted dirty, so the bit-for-bit clone is clean too. Move the single fsck to the end of build-base-rootfs.sh (where it has all the host's I/O to itself) and skip it in the per-system loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The base ext4 used to be built at 8 GB and each per-system rootfs clone ran resize2fs to grow to 200 GB. resize2fs on a 200 GB file is disk-heavy (it has to write group descriptor and bitmap metadata for every additional block group), and we did it 98 times in parallel. Build the base directly at 200 GB sparse with lazy_itable_init=1,lazy_journal_init=1. mkfs writes ~50 MB of superblock + GDT material upfront and defers the rest to lazy background init, so the image file's physical footprint is unchanged from the previous 8 GB layout (~1.8 GB). Per-system clones then need only `cp --sparse=always`: no resize2fs, no e2fsck, ~1 second each. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`umount` already syncs the filesystem being unmounted. The host-wide `sync` we were calling first flushes every dirty page on *every* mount — under 98-way parallel builds, each build's sync blocked on every other build's writeback, multiplying the wall-clock cost. Drop them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…olden When clickhouse's load `mv hits.parquet /var/lib/clickhouse/user_files/` (or any cross-FS move) copies the 14-75 GB dataset into the writable per-VM disk and then `rm`'s it after INSERT, ext4 marks those blocks free but the underlying virtio-blk file still carries the bytes. `cp --sparse=always` on the golden then preserves them as random data, so the per-system snapshot for a parquet engine carried a full extra copy of the dataset that the load already discarded. Adding `fstrim /opt/clickbench/sysdisk` and `fstrim /` before the host's snapshot makes the guest issue DISCARD for free blocks; the host loop driver responds by punching holes in the sparse backing file (linux loop devices advertise discard with PUNCH_HOLE since 4.x, which firecracker's virtio-blk passes through). The golden then holds only the bytes the engine actually keeps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Several systems' load scripts do `sudo mv hits_*.parquet /var/lib/<engine>/user_files/` or `sudo cp hits.csv .../extern/` followed by `chown` to the daemon's user. The mv/cp copies 14-75 GB of data the daemon reads once during INSERT and we delete right after — a complete waste of bytes on disk and time on the wire. Replace with `ln -s` + `chown -h` where the daemon's user-files dir is on a different filesystem from the dataset. `chown -h` chowns the symlink itself rather than following into the (often read-only) original; the underlying dataset is mode 644 anyway, so daemon processes can read through the symlink as their own user. Systems updated: clickhouse, clickhouse-tencent, pg_clickhouse, kinetica, oxla, ursa, arc, cockroachdb. Motivated by the ClickBench playground (Firecracker microVM service) where the dataset is mounted read-only and shared across all VMs; the copy step was the dominant cost on parquet/csv-format systems and pulled 14 GB into the per-VM snapshot golden disk unnecessarily. The change is also benign for the regular benchmark — daemons still read the same bytes, just through a symlink. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8080 is the default HTTP admin port for cockroach, the spark UI, trino, presto, druid, and a long tail of other JVM-based databases in the catalog. Our in-VM agent was binding it first, so when their ./start ran the daemon failed with "bind: address already in use" and the whole provision came down with a port conflict. Pick 50080 — uncommon enough that no ClickBench engine in the current catalog wants it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Several systems' load scripts call ../lib/download-hits-* — e.g. doris-parquet expects `download-hits-parquet-partitioned <doris_be_dir>` to materialize the dataset in a specific subdirectory of the BE's working tree. Previously we copied the lib tree into /opt/clickbench/ system/_lib, but ../lib from the system dir resolves to /opt/clickbench/lib, not /opt/clickbench/system/_lib. Put 4 stub scripts (one per format) at /opt/clickbench/lib in the base rootfs. Each one symlinks from the shared RO dataset mount into the target directory — same interface as upstream's wget-based scripts, but instant and zero-byte-on-disk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The firecracker-ci kernel is minimal: it boots fine, but Docker fails to start because it lacks iptables/nat, br_netfilter, veth and other modules that Docker needs to set up its bridge network. That killed ~6 Docker-using systems (byconity, cedardb, citus, cloudberry, greenplum) in the parallel provisioning run. Swap in Ubuntu's `linux-image-generic` kernel (the same one Ubuntu ships for cloud KVM guests). It has every Docker-required module plus a much richer driver set, while still booting under Firecracker. Trade-off: it lacks CONFIG_IP_PNP so the kernel's `ip=` boot arg is ignored. Add a tiny clickbench-net.service that parses `ip=` from /proc/cmdline and applies it to eth0 at boot; agent.service waits for it. The same rootfs continues to work with the firecracker-ci kernel (the systemd unit's `ip addr add` is idempotent — kernel-set IPs are already there). Verified: smoke-boot agent answered in 3 s on the new kernel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Ubuntu generic kernel builds overlay, veth, br_netfilter, iptable_nat, nf_conntrack and friends as loadable modules, not built-in. Without /lib/modules/<ver>/ in the rootfs the kernel can't load them at runtime — the immediate symptom was `Failed to mount /opt/clickbench/system` (overlayfs not available) and Docker still failing to start (no br_netfilter/iptable_nat). Drop the linux-modules-7.0.0-15-generic deb into the chroot, `dpkg --unpack` it into the rootfs, run `depmod`, and pre-load the critical modules via /etc/modules-load.d/clickbench.conf so they're ready before any service starts. The image grew from 1.8 to 2.0 GB physical (200 GB apparent) — modules add ~200 MB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`dpkg --unpack` records the modules package in dpkg's status DB without configuring it; subsequent `apt-get install` calls inside every per-system VM see an unconfigured package with unmet dependencies and bail with "Unmet dependencies. Try 'apt --fix-broken install'". That broke ~10 systems in the previous parallel run. Switch to `dpkg-deb -x` — extracts the data tarball into the rootfs without touching dpkg's DB. apt sees a normal system with all modules in /lib/modules/, and the kernel can load them at runtime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshot of the state after the 10th parallel run. Documents:
- what works end-to-end (microVM lifecycle, shared RO datasets disk,
per-restore disk hygiene, fstrim before snapshot, Ubuntu kernel
with modules)
- bug fixes pushed during the run (port 8080 conflict, mv→ln -s,
download-hits stubs, build/provision semaphores, redundant fsck/
resize2fs/sync removed, clickbench-net.service, kernel module
preload, 200 GB system disk for heavy systems)
- failure categories observed
- what's left for the long tail
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three independent failures observed in the 10th parallel run:
1. The 7 pg_* systems (pg_clickhouse, pg_duckdb*, pg_ducklake,
pg_mooncake) all failed to spawn firecracker with
`Firecracker panicked at main.rs:296: Invalid instance ID:
InvalidChar('_')`. Firecracker's --id rejects underscores. Map
`_` to `-` for the fc id (the system name itself stays intact).
2. duckdb / chdb-dataframe / duckdb-dataframe OOM-killed at 16 GB
("Out of memory: Killed process 578 (duckdb) anon-rss:15926176kB").
DuckDB and chdb hold the full dataset in memory during INSERT;
16 GB just isn't enough for the 100 M row hits set. Bump default
VM memory to 32 GB. KVM allocates lazily, so 98×32 GB on the host
is fine.
3. monetdb's install fails with `$USER: unbound variable`. systemd's
default service env has no USER/LOGNAME. Stamp them as root in
clickbench-agent.service so subprocess.run inherits them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ClickBench: fix elasticsearch load.py bytes/str mix
VM tweaks for the long tail of failures:
- chdb-dataframe / duckdb-dataframe materialize the full hits dataset
in process memory and need >32 GB. Default to 48 GB.
- Druid / Pinot / similar JVM stacks take 5-10 min to come up
(Zookeeper → Coordinator → Broker → Historical, in sequence). The
agent's 300 s check-loop wasn't enough; widen to 900 s.
elasticsearch/load.py: gzip.open in mode='rt' returns str docs, but
bulk_stream yields bytes for ACTION_META_BYTES and str for the doc.
requests.adapters.send() calls sock.sendall() on the mixed iterable
and crashes with `TypeError: a bytes-like object is required, not
'str'`. Open in 'rb' so docs are bytes — matches the rest of the
generator.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chdb-dataframe, duckdb-dataframe, polars-dataframe, daft-parquet, daft-parquet-partitioned load the whole hits dataset into a single in-process DataFrame. Observed peak RSS is 80-100 GB on the partitioned parquet set — even though KVM allocates lazily, sustaining that working set for shared use isn't feasible. Disable them in the registry rather than bump RAM for everyone. Revert the default per-VM RAM cap to 16 GB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
duckdb-memory's load OOM'd at 16 GB anon-rss — it's the same RAM- resident model as duckdb-dataframe/chdb-dataframe, just packaged as its own ClickBench entry. Add to the disabled-systems list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Firecracker CI kernel (vmlinux-6.1.141) does not include CONFIG_NF_TABLES — every nft call inside the VM returns 'Failed to initialize nft: Protocol not supported'. Ubuntu 24.04 defaults `update-alternatives --display iptables` to the nft variant, and dockerd's bridge-driver startup calls `iptables -t nat -N DOCKER`. The nft failure aborts dockerd → docker.service exits 1/FAILURE → every docker-based system fails at install time with Cannot connect to the Docker daemon at unix:///var/run/docker.sock The legacy backend uses ip_tables / iptable_nat / xt_* modules which the firecracker kernel does compile in (and the modules-load.d hook here pre-loads). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ts_partitioned/ build-datasets-image.sh rsyncs /opt/clickbench-playground/datasets/ verbatim, so the partitioned parquet files end up at /opt/clickbench/datasets_ro/hits_partitioned/hits_N.parquet inside the VM. The lib stub was linking from /opt/clickbench/datasets_ro/hits_N.parquet (no subdir) — every symlink dangled and every partitioned-parquet load script failed with 'No files found that match the pattern \"hits_*.parquet\"'. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After fixing the iptables-nft → legacy default, dockerd installs and starts cleanly. `docker run` then fails with: iptables v1.8.10 (legacy): can't initialize iptables table 'raw': Table does not exist (do you need to insmod?) because modprobe doesn't auto-load every iptables filter table on demand inside a stripped-down firecracker rootfs. dockerd's DIRECT ACCESS FILTERING uses the `raw` table; we already pre-load `iptable_nat`, so add `iptable_raw`, `iptable_filter`, `iptable_mangle`, and `xt_conntrack` to the list. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…p layout These three were added to the playground before being rewritten in the PR #860 split-benchmark.sh refactor — they still carried the old monolithic benchmark.sh + run.sh. Replace benchmark.sh with the thin shim that sources lib/benchmark-common.sh, drop run.sh, and add a data-size script measuring fb-volume (the bind-mounted firebolt-core data directory). install/start/check/load/query/stop already existed from when we wrote them per-step originally; this only catches the metadata files up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dockerd 28+ added "DIRECT ACCESS FILTERING": iptables -t raw DROP rules to block traffic going directly to container IPs. The Firecracker CI kernel doesn't compile in CONFIG_IP_NF_RAW, so 'iptables -t raw -A PREROUTING' fails with 'Table does not exist' and 'docker run' on the default bridge exits 125. Write /etc/docker/daemon.json setting the bridge driver's gateway_mode_ipv4/ipv6 = nat-unprotected. Container traffic still masquerades via the `nat` and `filter` tables (which the kernel does have); we lose the extra "host-bypass DROP" layer that's fine to skip in a sandboxed single-container microVM. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The gateway_mode_ipv4=nat-unprotected attempt didn't take effect for the auto-created default `bridge` network on docker.io 29.x — every docker run still tries to insert a `raw`-table DROP rule and fails with 'Table does not exist'. Set iptables=false in daemon.json: dockerd stops touching iptables altogether, port forwarding goes through the userland docker-proxy (which works fine for our single-container-per-VM use case), and the host-side net.enable_filtered_internet path still handles VM→upstream masquerade. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…heck timeout The 10-min wait-for-ready loop just printed firebolt-core did not become healthy in 10 min with zero context, so subsequent re-kicks were blind. Add docker ps / inspect / logs / ss listener / curl probe on the failure path so the provision log carries enough to triage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The diagnostic dump showed firebolt-core refusing to start with: The directory '/firebolt-core/volume/' (owner 0:0, permissions 755) is not readable or writeable by the Firebolt Core process (running as effective user 1111, effective group 1111). The agent provisions as root, so the bind-mounted host dir lands as root:root; firebolt-core inside the container is uid 1111 and won't initialize the engine. chown the host-side dir to 1111:1111 before docker run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… setup Reflink + transparent zstd are both native on btrfs, so the two-phase reflink-then-zstd snapshot dance is no longer needed: revert _snapshot_disks/_restore_disks to plain reflink and let the filesystem handle compression. Update install-firecracker.sh to document mkfs.btrfs + compress=zstd:1 as the recommended host setup; XFS still works for reflink but lacks compression and fills the host at ~7 TB. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Upstream ClickBench keeps the 100 hits_N.parquet partitioned files under hits_partitioned/; load scripts glob `hits_*.parquet` at cwd, not from a subdir. The agent relies on overlay magic for staging (lowerdir=datasets_ro, cwd=/opt/clickbench/system), and that surfaces files at root of the dataset image but leaves hits_partitioned/ as a subdir — the glob then matches nothing. Symptom: clickhouse / pg_clickhouse / ursa / daft-parquet-partitioned / duckdb-parquet-partitioned / duckdb-vortex-partitioned all hit 'No files found that match the pattern "hits_*.parquet"' (or the dialect-specific equivalent) at load time. Materialise the per-file symlinks in cwd in the agent rather than in each system's load script so the 6+ partitioned consumers don't each reimplement the same staging step (which historically rotted when one or two were updated and the rest weren't — upstream centralised this in lib/download-hits-* for the same reason). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ClickHouse v26.x canonicalises the filesystem-cache path before the policy check that 'absolute path must lie inside /var/lib/clickhouse/caches/'; an older trick of pointing caches/web at /dev/shm via symlink is now rejected with BAD_ARGUMENTS at CREATE TABLE time. Bind-mount /dev/shm/clickhouse onto /var/lib/clickhouse/caches/web so the kernel-canonicalised path stays inside caches/ but the underlying bytes still live in tmpfs (the whole point — cold queries pull ~1 GB into the cache and we don't want that on the host SSD). Also clean up a leftover symlink from previous install runs before the mkdir/mount so re-running install is idempotent. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
First-boot initdb inside the cedardb container runs through 'Fixing permissions on existing directory' and 'Setting up database directory' phases that take 90-120 s on cold disk before postgres actually listens. The 60 s budget bailed during that window, leaving the system in start-failed and never snapshotted. pg_isready exits fast once the daemon is up, so the longer timeout only changes behaviour in the failure path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tch) Provisioning always starts on a fresh per-VM rootfs, so the prior symlink-cleanup + mountpoint guard added nothing and just made the script noisier. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
v2.5.12 has an Arrow type-inference bug under static-schema mode: incoming JSON integers are inferred as Float64 even when the row fits in Int64, and every /ingest with an Int64-declared field fails with 400 "Fail to merge schema field 'X' because the from data_type = Float64 does not equal Int64". The load script's parallel ingest loop hit this on the very first chunk and logged ~5000 'curl: (22) HTTP 400' lines while loading zero rows; queries then returned 0 for everything. Verified the fix locally: v2.7.2 accepts the bundled static_schema.json and the playground's hits.json shape — single row ingest returns 200, COUNT(*) and AVG(UserID) both produce the expected values. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-all) The host's FORWARD policy is ACCEPT (Docker would flip it but we disable Docker's iptables management in the VM rootfs, and we don't want to flip the global policy ourselves — it would break unrelated host forwarding). disable_internet was only stripping the per-slot ACCEPTs and the POSTROUTING MASQUERADE, leaving every other packet to fall through to the default ACCEPT. Practical exploit: a VM with arbitrary code execution exposed to the benchmark consumer (pandas, polars, dataframe variants) could curl 169.254.169.254/latest/api/token and get a real IMDSv2 token — the AWS hypervisor responds to the VM's RFC1918 source address even without our MASQUERADE rule, and the reply gets forwarded back the same way through the still-ACCEPT default policy. From there an attacker can read the EC2 instance role's credentials. Datalake systems are accidentally safe (the PREROUTING REDIRECT to the SNI proxy catches TCP/80 before FORWARD, and the proxy's Host-header allowlist rejects 169.254.169.254) but every other system was wide open. Refactor: introduce _strip_slot(slot) that parses `iptables -S` output and removes every rule mentioning the slot's TAP or CIDR. Each enable/disable function calls it first, then installs its own rules — no more order-dependent interaction where a stale catch-all DROP from one mode silently blocks the next mode's ACCEPT. disable_filtered_internet is no longer needed (subsumed by _strip_slot) and goes away. disable_internet now installs explicit `-i tap -j DROP` and `-o tap -j DROP` so isolation no longer relies on the chain's default policy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The line `sys.stderr.write(f"{o[\"elapsed\"]}\n")` parses fine on
older Pythons that lex f-string contents textually but breaks on
Python 3.12+ where PEP 701 parses the brace contents as a real
expression — and a backslash inside a Python expression (outside a
string literal) is invalid, so every query failed with
"unexpected character after line continuation character" before
even reaching the server.
Drop the f-string for plain str() concatenation; no quote-nesting,
no version-dependent lexer quirk.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
_build_images_if_needed short-circuits when both rootfs.ext4 and system.ext4 already exist, on the assumption that re-cloning costs disk for no benefit. That's wrong whenever base-rootfs.ext4 has been rebuilt since: the in-VM agent and the lib/download-* stubs live in the base, and the per-system scripts live in the sysdisk upper — and both stay stale. Concrete bite: today's agent change to stage partitioned parquet symlinks at cwd shipped in base-rootfs.ext4 at 18:05, but every already-provisioned partitioned system that we re-kicked afterwards (datafusion-partitioned and friends) booted off the pre-fix rootfs.ext4 from 15:39, ran the OLD agent that doesn't stage anything, and the load script's `mv hits_*.parquet partitioned/` matched zero files — leaving the parquet external-table empty and every query failing with 'No field named "EventDate"' / 'table hits not found'. Fix: compare mtimes; if base is newer, drop both the rootfs and the sysdisk so build-system-rootfs.sh runs and re-rsyncs both. On btrfs `cp --sparse=always` is a reflink — re-cloning a 200 GB sparse rootfs is near-instant, so the conservative invalidation isn't expensive. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The README and architecture doc were conceptual; nothing walked through "from a blank Ubuntu 24.04 box to a serving playground". INSTALL.md does, in order: format btrfs + zstd, clone repo, set up sudoers, install firecracker/kernel/DNS/(optional)TLS, download datasets, build datasets image, build base rootfs, configure ClickHouse Cloud logging, start the server, provision the catalog. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cross-system tooling (playground sweep, agent /query) keys off queries.sql by filename even when the contents aren't SQL. siglens ships SPL/Splunk QL but the file extension was producing NO_QUERIES misses in every catalog-wide sweep. Renaming aligns with every other system in the repo; the contents are unchanged, and benchmark.sh already declared BENCH_QUERIES_FILE accordingly (now matches reality, the override line is unnecessary but harmless). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two playground-agent behaviours used to be controlled by separate
mechanisms — an opaque .preserve-state file in the system dir for
"skip the pre-snapshot stop+start cycle" and nothing at all for
"force ./stop after snapshot restore". Both are now driven by
per-system variables in benchmark.sh, the same surface that
already exposes BENCH_DOWNLOAD_SCRIPT / BENCH_DURABLE /
BENCH_QUERIES_FILE.
PLAYGROUND_SKIP_RESTART_BEFORE_SNAPSHOT=yes
The loaded state lives only in the daemon's process memory
(pandas / polars / duckdb-dataframe / daft-parquet / chdb-
dataframe / polars-dataframe — and pinot / tidb which have
slow JVM/cluster bring-up worth snapshotting hot). Stopping
pre-snapshot would wipe the in-process DataFrame and the
restored snapshot would serve queries against a daemon whose
`hits = None`. Replaces the .preserve-state marker file.
PLAYGROUND_RESTART_AFTER_RESTORE_SNAPSHOT=yes
After a firecracker memory snapshot+restore the cluster's
internal connections (brpc, gossip) are stale; the system's
./start does a shallow health probe ("SELECT 1" against the
local node) and short-circuits, leaving the broken cross-node
connections in place — every subsequent query then fails with
"Connection refused" / "no available searcher nodes in the
cluster". byconity and quickwit both showed this; opting them
in causes the agent to force ./stop on btime shift before the
next ./start so the bring-up is from a clean state.
Agent reads the vars by grep, NOT by sourcing benchmark.sh (which
ends with `exec ../lib/benchmark-common.sh`). Both vars live next
to BENCH_DURABLE in the per-system shim, so the contract stays in
one file.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…fix) When we set `iptables: false` in /etc/docker/daemon.json (to work around the missing kernel CONFIG_IP_NF_RAW on the firecracker guest kernel — Docker 28+'s DIRECT ACCESS FILTERING insists on the raw table), dockerd stopped installing its usual nat-table rule: -t nat -A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE Container-originated packets then leave the VM with their docker0 source intact (172.17.0.x). The host's per-slot MASQUERADE matches only the VM TAP CIDR (10.200.X.0/24), so the 172.17.0.x packet exits ens1 unchanged and AWS drops it. Empirically: presto-datalake's load failed with `Name or service not known` for clickhouse-public-datasets.s3.eu-central-1.amazonaws.com, and cloudberry's install failed inside a Rocky Linux container with `Could not resolve host: mirrors.rockylinux.org`. Replicate the missing rule via a small systemd unit that runs after docker.service. The nat table is intact (it's `raw` that isn't compiled in), so MASQUERADE works fine. Also: - cedardb / cedardb-parquet: bump start-ready timeout 300s → 600s (the container's initdb takes longer than 5 min on the cold sysdisk; this was the proximate cause of two HEALTHCHECK-TIMEOUT failures in the last sweep). - trino-datalake / trino-datalake-partitioned: set BENCH_CHECK_TIMEOUT=1800. Trino's cold JVM bootstrap pushes past the lib's 300 s default, then keeps going for several more minutes; both variants timed out at the 900 s ./check budget. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cedardb base variant got bumped to 600s in the last commit but cedardb-parquet still had the older 60s, so it would have hit the same HEALTHCHECK-TIMEOUT failure mode again. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The agent waited up to a hardcoded 900 s for ./check to succeed after ./start, regardless of what the per-system benchmark.sh declared. trino-datalake / trino-datalake-partitioned bumped BENCH_CHECK_TIMEOUT=1800 to cover Trino's cold-JVM bootstrap, but the agent ignored it and bailed at 900 s — exactly the "check did not succeed within 900s" we saw. Read the override via the same _bench_var() grep that handles PLAYGROUND_SKIP_RESTART_BEFORE_SNAPSHOT etc., and clamp to a floor of 900 s so the existing baseline still covers Druid / Pinot / similar JVM stacks that don't declare an override. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The upstream cap is sized for a bare-metal benchmark machine. Playground VMs have 16 GiB RAM total, so a 27 GB RAM tier overshoots physical memory; kinetica's rank-1 worker gets OOM-killed mid-LOAD and the load fails with `[GPUdb]executeSql: Internal_Error: Rank 1 non-responsive (Table:"ki_home.hits")`. Keeping 7 GiB of headroom for the agent, dockerd, and the rest of the kinetica plane keeps the load on the disk tier and the load completes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Every OOM during ./load just printed psql:create.sql:109: ERROR: unable to allocate memory and we couldn't tell whether the agent's mkswap+swapon actually ran, whether the container saw the swap, or whether the sysctl tweaks (overcommit_memory=1, max_map_count, swappiness) stuck. With umbra in NEEDS_SWAP and a 256 GiB swap.raw attached, OOM shouldn't be possible — but it is, so dump enough state at the end of ./start that the next failure tells us where to look. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Trino's cold-start in the datalake configuration (hive catalog, S3 credentials shim) ran past the 1800s budget on the last provision. trino (non-datalake) and trino-partitioned snapshot fine on the same 900s default, so the slowdown is specific to the catalog/S3 config — give the cold path another 30 min and revisit with diagnostics if it still doesn't land. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Umbra's COPY consistently ENOMEMs ~9 min into the load on the default 16 GiB VM, even with NEEDS_SWAP (256 GiB swap.raw active, overcommit_memory=1, no docker cgroup memory cap). The diagnostic dump confirmed swap is mounted and the container's memory.max / memory.swap.max are 'max', so the kernel isn't the one refusing — umbra's own allocator hits a wall at the working- set peak before the swap path can catch up. Add VM_MEM_OVERRIDES_MIB in systems.py and have vm_manager pull mem_size_mib from it (falling back to the host's vm_mem_mib). Bump umbra to 32 GiB; the COPY then finishes, the snapshot carries the warm working set, and restored queries don't pay reload cost. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This reverts commit b9f4983.
Diagnostic dump from the last failure showed swap mounted, swap unused, container memory.max/swap.max both 'max'. The remaining hypothesis for the ENOMEM is umbra calling mlock() on a chunk bigger than the 8 MiB RLIMIT_MEMLOCK we explicitly set — mlock returns ENOMEM independent of how much swap is available, since locked pages by definition can't be paged out. - Switch the docker --ulimit from memlock=8388608 to memlock=-1 (unlimited). - Also dump vm.overcommit_memory / .swappiness / .max_map_count and the container's effective `ulimit -l` so the next failure conclusively tells us whether the sysctl tweaks stuck and what the container actually sees. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ntainer `./load` does `ln -f hits_*.parquet data/hits/` to populate the hive `external_location`. With the agent now staging partitioned parquet as symlinks at cwd pointing to /opt/clickbench/datasets_ro/hits_partitioned/hits_N.parquet, GNU ln's default behavior (`-P`) creates a hardlink to the SYMLINK inode rather than dereferencing — so `data/hits/hits_N.parquet` is a hardlink to a symlink whose target is an absolute host-VM path the container can't see. Inside the trino/presto container the symlinks all dangle, the hive external_location appears empty, and queries return 0 rows. Add `-v /opt/clickbench/datasets_ro:/opt/clickbench/datasets_ro:ro` to both containers so the absolute symlink targets resolve from inside the container too. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.